Learning-from-Observation (LfO) is a robot teaching framework for programming operations through few-shots human demonstration. While most previous LfO systems run with visual demonstration, recent research on robot teaching has shown the effectiveness of verbal instruction in making recognition robust and teaching interactive. To the best of our knowledge, however, few solutions have been proposed for LfO that utilizes verbal instruction, namely multimodal LfO. This paper aims to propose a practical pipeline for multimodal LfO. For input, an user temporally stops hand movements to match the granularity of human instructions with the granularity of robot execution. The pipeline recognizes tasks based on step-by-step verbal instructions accompanied by demonstrations. In addition, the recognition is made robust through interactions with the user. We test the pipeline on a real robot and show that the user can successfully teach multiple operations from multimodal demonstrations. The results suggest the utility of the proposed pipeline for multimodal LfO.
translated by 谷歌翻译
Robot developers develop various types of robots for satisfying users' various demands. Users' demands are related to their backgrounds and robots suitable for users may vary. If a certain developer would offer a robot that is different from the usual to a user, the robot-specific software has to be changed. On the other hand, robot-software developers would like to reuse their developed software as much as possible to reduce their efforts. We propose the system design considering hardware-level reusability. For this purpose, we begin with the learning-from-observation framework. This framework represents a target task in robot-agnostic representation, and thus the represented task description can be shared with various robots. When executing the task, it is necessary to convert the robot-agnostic description into commands of a target robot. To increase the reusability, first, we implement the skill library, robot motion primitives, only considering a robot hand and we regarded that a robot was just a carrier to move the hand on the target trajectory. The skill library is reusable if we would like to the same robot hand. Second, we employ the generic IK solver to quickly swap a robot. We verify the hardware-level reusability by applying two task descriptions to two different robots, Nextage and Fetch.
translated by 谷歌翻译
将口头文本映射到手势是具有对话能力的机器人的重要研究主题。根据人类共同语音手势的研究,用于映射的合理解决方案是使用基于概念的方法,其中文本首先映射到包含具有类似含义的文本的语义集群(即,概念)。随后,每个概念映射到预定义的手势。通过使用基于概念的方法,本文讨论了获得针对会话代理的独特词汇概念的实际问题。使用Microsoft Rinna作为代理商,我们通过通过社会学方法通过自然语言处理(NLP)方法来定制通过自然语言处理(NLP)方法来进行自动获得的概念。然后,我们确定了NLP方法的三个限制:用Emojis和符号的语义级别;在语义层面与俚语,新词和流行语;并且在务实的水平。我们将这些限制归因于rinna的个性化词汇。后续实验表明,使用基于概念的方法选择的机器人手势留下比rinna词汇的随机选择的手势更好的印象,这表明了基于概念的手势生成系统进行个性化词汇表。本研究提供了对具有个性化词汇表的会话代理商的姿态生成系统的开展了解。
translated by 谷歌翻译
Classification bandits are multi-armed bandit problems whose task is to classify a given set of arms into either positive or negative class depending on whether the rate of the arms with the expected reward of at least h is not less than w for given thresholds h and w. We study a special classification bandit problem in which arms correspond to points x in d-dimensional real space with expected rewards f(x) which are generated according to a Gaussian process prior. We develop a framework algorithm for the problem using various arm selection policies and propose policies called FCB and FTSV. We show a smaller sample complexity upper bound for FCB than that for the existing algorithm of the level set estimation, in which whether f(x) is at least h or not must be decided for every arm's x. Arm selection policies depending on an estimated rate of arms with rewards of at least h are also proposed and shown to improve empirical sample complexity. According to our experimental results, the rate-estimation versions of FCB and FTSV, together with that of the popular active learning policy that selects the point with the maximum variance, outperform other policies for synthetic functions, and the version of FTSV is also the best performer for our real-world dataset.
translated by 谷歌翻译
Esports, a sports competition using video games, has become one of the most important sporting events in recent years. Although the amount of esports data is increasing than ever, only a small fraction of those data accompanies text commentaries for the audience to retrieve and understand the plays. Therefore, in this study, we introduce a task of generating game commentaries from structured data records to address the problem. We first build a large-scale esports data-to-text dataset using structured data and commentaries from a popular esports game, League of Legends. On this dataset, we devise several data preprocessing methods including linearization and data splitting to augment its quality. We then introduce several baseline encoder-decoder models and propose a hierarchical model to generate game commentaries. Considering the characteristics of esports commentaries, we design evaluation metrics including three aspects of the output: correctness, fluency, and strategic depth. Experimental results on our large-scale esports dataset confirmed the advantage of the hierarchical model, and the results revealed several challenges of this novel task.
translated by 谷歌翻译
Human pose estimation, particularly in athletes, can help improve their performance. However, this estimation is difficult using existing methods, such as human annotation, if the subjects wear loose-fitting clothes such as ski/snowboard wears. This study developed a method for obtaining the ground truth data on two-dimensional (2D) poses of a human wearing loose-fitting clothes. This method uses fast-flushing light-emitting diodes (LEDs). The subjects were required to wear loose-fitting clothes and place the LED on the target joints. The LEDs were observed directly using a camera by selecting thin filmy loose-fitting clothes. The proposed method captures the scene at 240 fps by using a high-frame-rate camera and renders two 30 fps image sequences by extracting LED-on and -off frames. The temporal differences between the two video sequences can be ignored, considering the speed of human motion. The LED-on video was used to manually annotate the joints and thus obtain the ground truth data. Additionally, the LED-off video, equivalent to a standard video at 30 fps, confirmed the accuracy of existing machine learning-based methods and manual annotations. Experiments demonstrated that the proposed method can obtain ground truth data for standard RGB videos. Further, it was revealed that neither manual annotation nor the state-of-the-art pose estimator obtains the correct position of target joints.
translated by 谷歌翻译
Generative models, particularly GANs, have been utilized for image editing. Although GAN-based methods perform well on generating reasonable contents aligned with the user's intentions, they struggle to strictly preserve the contents outside the editing region. To address this issue, we use diffusion models instead of GANs and propose a novel image-editing method, based on pixel-wise guidance. Specifically, we first train pixel-classifiers with few annotated data and then estimate the semantic segmentation map of a target image. Users then manipulate the map to instruct how the image is to be edited. The diffusion model generates an edited image via guidance by pixel-wise classifiers, such that the resultant image aligns with the manipulated map. As the guidance is conducted pixel-wise, the proposed method can create reasonable contents in the editing region while preserving the contents outside this region. The experimental results validate the advantages of the proposed method both quantitatively and qualitatively.
translated by 谷歌翻译
Different machine learning (ML) models are trained on SCADA and meteorological data collected at an onshore wind farm and then assessed in terms of fidelity and accuracy for predictions of wind speed, turbulence intensity, and power capture at the turbine and wind farm levels for different wind and atmospheric conditions. ML methods for data quality control and pre-processing are applied to the data set under investigation and found to outperform standard statistical methods. A hybrid model, comprised of a linear interpolation model, Gaussian process, deep neural network (DNN), and support vector machine, paired with a DNN filter, is found to achieve high accuracy for modeling wind turbine power capture. Modifications of the incoming freestream wind speed and turbulence intensity, $TI$, due to the evolution of the wind field over the wind farm and effects associated with operating turbines are also captured using DNN models. Thus, turbine-level modeling is achieved using models for predicting power capture while farm-level modeling is achieved by combining models predicting wind speed and $TI$ at each turbine location from freestream conditions with models predicting power capture. Combining these models provides results consistent with expected power capture performance and holds promise for future endeavors in wind farm modeling and diagnostics. Though training ML models is computationally expensive, using the trained models to simulate the entire wind farm takes only a few seconds on a typical modern laptop computer, and the total computational cost is still lower than other available mid-fidelity simulation approaches.
translated by 谷歌翻译
Wind turbine wake modelling is of crucial importance to accurate resource assessment, to layout optimisation, and to the operational control of wind farms. This work proposes a surrogate model for the representation of wind turbine wakes based on a state-of-the-art graph representation learning method termed a graph neural network. The proposed end-to-end deep learning model operates directly on unstructured meshes and has been validated against high-fidelity data, demonstrating its ability to rapidly make accurate 3D flow field predictions for various inlet conditions and turbine yaw angles. The specific graph neural network model employed here is shown to generalise well to unseen data and is less sensitive to over-smoothing compared to common graph neural networks. A case study based upon a real world wind farm further demonstrates the capability of the proposed approach to predict farm scale power generation. Moreover, the proposed graph neural network framework is flexible and highly generic and as formulated here can be applied to any steady state computational fluid dynamics simulations on unstructured meshes.
translated by 谷歌翻译
Removing reverb from reverberant music is a necessary technique to clean up audio for downstream music manipulations. Reverberation of music contains two categories, natural reverb, and artificial reverb. Artificial reverb has a wider diversity than natural reverb due to its various parameter setups and reverberation types. However, recent supervised dereverberation methods may fail because they rely on sufficiently diverse and numerous pairs of reverberant observations and retrieved data for training in order to be generalizable to unseen observations during inference. To resolve these problems, we propose an unsupervised method that can remove a general kind of artificial reverb for music without requiring pairs of data for training. The proposed method is based on diffusion models, where it initializes the unknown reverberation operator with a conventional signal processing technique and simultaneously refines the estimate with the help of diffusion models. We show through objective and perceptual evaluations that our method outperforms the current leading vocal dereverberation benchmarks.
translated by 谷歌翻译